UWAGGS

Big Data Cup Starter

2024-02-05

Agenda

Plan for Big Data Cup

Rules and schedule

https://www.stathletes.com/big-data-cup/

There will be 2 categories for participants:

Teams can be 1-4 participants.

Finalists will be selected* on will have the opportunity to present their findings to our panel of sport executives at our reception at Rotman on April 19th.

Prizes will be awarded to top qualifiers.

Participation in Big Data Cup competition is open any individual regardless of background, experience, previous analysis, or public work.

Rules and schedule

https://www.stathletes.com/big-data-cup/

2024 Timeline and Key Dates

Rules and schedule

https://www.stathletes.com/big-data-cup/

All are encouraged to submit a written report that will be due March 8th, 2024.

Maximum 6 pages, including figures (size limit 10GB on submission).

Submissions can be emailed to: bigdatacup@stathletes.com with subject line: Big Data Cup 2024.

Please note that email size is limited to 25MB, to send larger submissions (up to 10GB), use Dropbox, Google Drive or other file-sharing services and include the link in your submission email.

Rules and schedule

There are two research areas

Please identify in the title which area you are focusing on:

Substantive knowledge

You should know a little bit about hockey before you start. If you know soccer/football, you’re well on your way. Here’s a football-hockey translation guide I wrote in 2020.

https://www.stats-et-al.com/2020/08/soccer-to-hockey-translation-guide.html

Getting the data

From https://github.com/bigdatacup/Big-Data-Cup-2024, we download BDC_2024_Womens_Data.csv and load it

bdc = read.csv("BDC_2024_Womens_Data.csv")
head(bdc)
##         Date             Home.Team      Away.Team Period Clock
## 1 2023-11-08 Women - United States Women - Canada      1 20:00
## 2 2023-11-08 Women - United States Women - Canada      1 19:57
## 3 2023-11-08 Women - United States Women - Canada      1 19:54
## 4 2023-11-08 Women - United States Women - Canada      1 19:52
## 5 2023-11-08 Women - United States Women - Canada      1 19:50
## 6 2023-11-08 Women - United States Women - Canada      1 19:50
##   Home.Team.Skaters Away.Team.Skaters Home.Team.Goals Away.Team.Goals
## 1                 5                 5               0               0
## 2                 5                 5               0               0
## 3                 5                 5               0               0
## 4                 5                 5               0               0
## 5                 5                 5               0               0
## 6                 5                 5               0               0
##                    Team              Player           Event X.Coordinate
## 1        Women - Canada Marie-Philip Poulin     Faceoff Win          100
## 2        Women - Canada   Jocelyne Larocque   Puck Recovery           50
## 3        Women - Canada   Jocelyne Larocque            Play            3
## 4        Women - Canada         Renata Fast            Play            6
## 5        Women - Canada        Emma Maltais Incomplete Play           48
## 6 Women - United States       Hilary Knight        Takeaway          141
##   Y.Coordinate Detail.1 Detail.2 Detail.3 Detail.4            Player.2
## 1           42 Backhand                                   Taylor Heise
## 2           10                                                        
## 3           59 Indirect                                    Renata Fast
## 4           21   Direct                                   Emma Maltais
## 5            2   Direct                            Marie-Philip Poulin
## 6           72                                                        
##   X.Coordinate.2 Y.Coordinate.2
## 1             NA             NA
## 2             NA             NA
## 3              4             35
## 4             48              2
## 5             62             28
## 6             NA             NA

EDA

table(bdc$Event)
## 
##     Dump In/Out     Faceoff Win            Goal Incomplete Play   Penalty Taken 
##             591             209              20             773              37 
##            Play   Puck Recovery            Shot        Takeaway      Zone Entry 
##            2333            2266             403             287             540
bdc_shots = subset(bdc, Event == "Shot")

EDA

The coordinates are already adjusted for possession. For example, all the shots are between 125 and 200 feet in the x-coordinate, suggesting they are all taken from the attacking zone of whomever is shooting.

plot(bdc_shots$X.Coordinate, bdc_shots$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))

EDA

Other plays don’t have this location restriction. Plays can happen everywhere

# To do: Draw a whole rink overlay
plot(bdc$X.Coordinate, bdc$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")

EDA

Zone Entries are recorded in this

bdc_zone = subset(bdc, Event == "Zone Entry")
plot(bdc_zone$X.Coordinate, bdc_zone$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")

EDA

Including dump-ins

bdc_zone = subset(bdc, Event %in% c("Dump In/Out", "Zone Entry"))
plot(bdc_zone$X.Coordinate, bdc_zone$Y.Coordinate, xlim=c(0,200), ylim=c(0,85))
abline(v=c(125,189,200), lwd=3, col=c("Blue","Red","Black"))
abline(v=c(75,11,0,100), lwd=3, col=c("Blue","Red","Black","Black"))
abline(h=c(0,85), lwd=3, col="Black")

What can we do?

Reapplying the same analyses

Possible sources of inspiration

K-means clustering (NHL)

Today we’re going to apply the principles of Chapter 9.6, K-means in R, of the DSCI 100 textbook, found at https://ubc-dsci.github.io/introduction-to-datascience/clustering.html#k-means-in-r towards a data frame from the 2016-17 Regular Season of the National Hockey League.

## Warning: package 'ggplot2' was built under R version 4.2.2
df_shots
## # A tibble: 66,771 × 50
##    season gcode refdate event period seconds etype    a1    a2    a3    a4    a5
##     <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2.02e7 20001    5399     8      1      71 SHOT   2964  1836  1851  1743  2211
##  2 2.02e7 20001    5399    19      1     173 SHOT   2254  2930  2004  2443  2514
##  3 2.02e7 20001    5399    28      1     241 SHOT   2964  1836  1851  2443  2211
##  4 2.02e7 20001    5399    32      1     286 SHOT   2964  1836  1851  1743  2514
##  5 2.02e7 20001    5399    49      1     406 SHOT   2043  2622  1021  2443  2514
##  6 2.02e7 20001    5399    52      1     450 SHOT   2964  1836  1851  2966  2211
##  7 2.02e7 20001    5399    59      1     509 SHOT   2254  2930  2004  2443  2211
##  8 2.02e7 20001    5399    62      1     540 SHOT   2910  2912  2965  2966  2211
##  9 2.02e7 20001    5399    75      1     616 SHOT   2254  2930  2004  2211  2514
## 10 2.02e7 20001    5399    76      1     625 SHOT   2254  2930  2004  2211  2514
## # … with 66,761 more rows, and 38 more variables: a6 <dbl>, h1 <dbl>, h2 <dbl>,
## #   h3 <dbl>, h4 <dbl>, h5 <dbl>, h6 <dbl>, ev.team <chr>, ev.player.1 <dbl>,
## #   ev.player.2 <dbl>, ev.player.3 <dbl>, distance <dbl>, type <chr>,
## #   homezone <chr>, xcoord <dbl>, ycoord <dbl>, awayteam <chr>, hometeam <chr>,
## #   home.score <dbl>, away.score <dbl>, event.length <dbl>, away.G <dbl>,
## #   home.G <dbl>, home.skaters <dbl>, away.skaters <dbl>,
## #   adjusted.distance <lgl>, shot.prob.distance <lgl>, …

K-means clustering (NHL)

This data frame consists of every officially recorded shot in 1226 of the 1230 games of the regular season. (The play-by-play details of the remaining 4 games are not available at the source, nhl.com). Scoring officials recorded 66,771 shots during these games. For most of these shots, we know, among other variables:

K-means clustering (NHL)

There are many, many questions we can answer using this data, including

K-means clustering (NHL)

For now we are going to narrow our focus on two questions that K-means clustering can answer: Are there archtypal shot locations, and if so where are they. Taking a scatterplot of the raw data, we see some patterns and challenges:

gr1 <- ggplot(df_shots, aes(x = xcoord,  y = ycoord)) +
  geom_point() +
  xlab("x from center ice (feet)") +
  ylab("y from center ice (feet)")
  
plot(gr1)

K-means clustering (NHL)

First, there are far too many points to meaningfully visualize many details and trends. Over the course of a season, shots come from almost everywhere, but there is a dramatic drop-off in density between \(x=-25\) and \(x=25\). The following diagram of NHL rink dimensions can explain why: players almost always take shots between the goal and the blue line.

K-means clustering (NHL)

We can simplify our analysis examining the absolute value of the x-coordinates instead of the x-coordinates themselves. We also flip the y-coordinate so that we’re doing a rotation around center ice and things like left- and right-wing mean the same thing regardless of the side of the ice.

to_flip <- which(df_shots$xcoord < 0)

df_shots$xcoord <- abs(df_shots$xcoord)
df_shots$ycoord[to_flip] <- -df_shots$ycoord[to_flip]

K-means clustering (NHL)

We can also use a graphical method called contour plotting to visualize the many overlapping data points and better see locations of high and low shot density. In the following graph, the brightly coloured regions is are the locations of the highest shot density.

gr2 <- ggplot(df_shots, aes(x = xcoord, y = ycoord)) +
        geom_density_2d_filled() +
        xlab("x from center ice (feet)") +
        ylab("y from center ice (feet)")

plot(gr2)

K-means clustering (NHL)

There are either three or five locations of relatively high shot density. The dominant location is immediately in front of the net at \(x=89, y=0\). The secondary locations are at the back corners, near \(x=35, y= \pm 25\). The tertiary locations are between the back corners and the net, near \(x=60, y= \pm 15\); these locations are called ‘the slots’.

Because the values already represent locations in physical space, we will not standardize.

df_shotloc = subset(df_shots, select = c(xcoord, ycoord))

K-means clustering (NHL)

Let’s try a k-means clustering using different values for k and compare the Within (cluster) Sum of Squares Distance, or WSSD. Recall from Section 9.4 at https://ubc-dsci.github.io/introduction-to-datascience/clustering.html#k-means that we want both a small WSSD and a small number of clusters, which will find at the ‘elbow’ in the following scree plot.

Unfortunately, there is no well-defined elbow here, but both and \(k=3\) and \(k=5\) are good candidates. We will use \(k=5\).

wssd <- rep(NA,9)

for(k in 2:10)
{
   shot_clust <- kmeans(df_shotloc, centers = k)
   wssd[k-1] <- shot_clust$tot.withinss
}

centers <- 2:10
dat <- data.frame(centers, wssd)
 
gr3 <- ggplot(dat, aes(x=centers, y=wssd)) +
        geom_line() + 
        geom_point() +
        xlab("number of clusters") +
        ylab("WSSD")
plot(gr3)

K-means clustering (NHL)

Do the centers of the clusters align with our previous visuals-based intuition? We can find the centers of each of the clusters from the object returned by the function kmeans. We can also find the number of shots that fit into each cluster.

Visually, we found one large cluster near \(x=80, y=0\), two medium-sized clusters at \(x=35, y= \pm 25\), and two small clusters at \(x=60, y= \pm 15\).

set.seed(12345)
shot_clust_5 <- kmeans(df_shotloc, centers = 5)
shot_clust_5$centers
##     xcoord      ycoord
## 1 76.54126  -0.6181961
## 2 38.03774 -18.6196618
## 3 65.72413  22.4948896
## 4 65.78030 -23.5139906
## 5 38.15829  21.0938875

K-means clustering (NHL)

Let’s add the centers as bright numbers to the contour plot to back this up.

shot_centers <- as.data.frame(shot_clust_5$centers)

gr4 <- gr2 + geom_point(data=shot_centers, aes(x=xcoord, y=ycoord), 
                        inherit.aes = FALSE, col="Red", size = 7, pch=as.character(1:5))

plot(gr4)

K-means clustering (NHL)

The centers of the clusters agree with our visual inspection. Cluster 3 represents the shots near the goal net. Clusters 2 and 4 represent the shots by the top and bottom back corners, respectively. Clusters 1 and 5 represent the bottom and top slots, respectively.

What are the relative sizes of these clusters?

shot_clust_5$size
## [1] 21642 11474 11643 11329 10683

K-means clustering (NHL)

The cluster of shots near the goal is twice as large as each of the other two clusters. From the contour plot, it may seem as the number of shots near the net should be much larger, but the other four clusters would appear less prominently if they were more diffuse. That is, if corner (2 and 4) and slot (1 and 5) shots came from more varied locations than net (3) shots, they might appear less bright in a contour plot.

We can find relative diffusion by looking at the root mean-squared distance, which is \(RMSD_i = \sqrt{WSSD_i / n_i}\), where \(n_i\) is the size of each cluster, \(WSSD_i\) is the sum of squared distance within each cluster, and \(i = 1, \ldots , k\).

msd <- sqrt(shot_clust_5$withinss / shot_clust_5$size)
msd
## [1] 10.17421 14.98870 12.67287 12.35079 13.93232

K-means clustering (NHL)

The cluster of net shots is the least diffuse, \(RMSD_3 \sim 10 \mathrm{ft}\), followed by the corner shots, \(RMSD_{\{2,4\}} \sim 12.5 \mathrm{ft}\), followed by the slot shots \(RMSD_{\{1,5\}} \sim 14.5 \mathrm{ft}\).

Suggested Exercises:

  1. The variable ev.team contains a three-character code of the team that took the shot in question. Use the filter function and the code in this case study to explore the shooting pattern of your favourite team (sorry, no Golden Knights or Kraken in 2016-17. Does your team follow the same shooting patterns as the league as a whole?

  2. Find the distance from each cluster mean to the center of the net, at \(x=89, y=0\).

  3. Repeat questions 1 and 2 for the case where there are 3 clusters rather than 5. Briefly describe the 3 clusters that emerge.

unique(df_shots$ev.team)
##  [1] "TOR" "OTT" "STL" "CHI" "CGY" "EDM" "L.A" "S.J" "MTL" "BUF" "NYR" "NYI"
## [13] "WSH" "PIT" "BOS" "CBJ" "DET" "T.B" "MIN" "CAR" "WPG" "ANA" "DAL" "NSH"
## [25] "PHI" "N.J" "FLA" "COL" "ARI" "VAN"

Hockey rink dimensions chart from https://www.sportsfeelgoodstories.com/hockey-rink-dimensions-size-diagram/

K-means in BDC

Now lets try contour plotting to visualize Big Data Cup data points. As before, the brightly coloured regions is are the locations of the highest shot density.

gr_b1 <- ggplot(bdc_shots, aes(x = X.Coordinate, y = Y.Coordinate)) +
        geom_density_2d_filled() +
        xlab("x from center ice (feet)") +
        ylab("y from center ice (feet)")

plot(gr_b1)

K-means in BDC

We can also find any cluster centers with kmeans. Starting with k=5.

For the men, we found one large cluster near \(x=180, y=42.5\), two medium-sized clusters at \(x=135, y= 42.5 \pm 25\), and two small clusters at \(x=160, y= 42.5 \pm 15\). (After adjusting for the BDC coordinate system)

set.seed(12345)
bdc_shotloc = subset(bdc_shots, select = c(X.Coordinate, Y.Coordinate))
shot_clust_5w <- kmeans(bdc_shotloc, centers = 5)
shot_clust_5w$centers
##   X.Coordinate Y.Coordinate
## 1     142.6986     31.15068
## 2     178.7368     41.83333
## 3     167.9231     61.94872
## 4     167.0132     18.09211
## 5     141.0000     66.95161

K-means in BDC

Try again for k=3. This time the corners disappear and the ‘slot’ shots remain.

set.seed(12345)
shot_clust_3w <- kmeans(bdc_shotloc, centers = 3)
shot_clust_3w$centers
##   X.Coordinate Y.Coordinate
## 1     151.1250     23.59167
## 2     177.6215     43.33898
## 3     148.6038     65.08491

Try and do an animation

Set up all the events on a ggplot

bdc$minute = 20*(bdc$Period - 1) + 19 - as.numeric(str_extract(bdc$Clock, "^[0-9]+"))
bdc$minute = pmax(1, bdc$minute)
bdc$minute = pmin(60, bdc$minute)

gr_all <- ggplot(bdc, aes(x = X.Coordinate,  y = Y.Coordinate)) +
  geom_point() +
  xlab("x from defensive end (feet)") +
  ylab("y from camera side (feet)")
  
plot(gr_all)

Try and do an animation

Use GGanimate, thanks to https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/

library(gganimate)
## Warning: package 'gganimate' was built under R version 4.2.3
gr_anim <- gr_all + transition_time(minute) +
  labs(title = "Minute: {frame_time}")

#animate(gr_anim, fps=5)
gr_anim

Try and do an animation

Let’s try this with a spatial correct. Instead of feet from defensive base, let’s use feet from visitor base.

idx_team_is_home = which(bdc$Team == bdc$Home.Team)

bdc$X.Coordinate.adj = bdc$X.Coordinate
bdc$X.Coordinate.adj[idx_team_is_home] = 200 - bdc$X.Coordinate[idx_team_is_home]
bdc$Y.Coordinate.adj = bdc$Y.Coordinate
gr_all2 <- ggplot(bdc, aes(x = X.Coordinate.adj,  y = Y.Coordinate.adj)) +
  geom_point() +
  xlab("x from defensive end (feet)") +
  ylab("y from camera side (feet)")
  
plot(gr_all2)

Try and do an animation

Animate with the spatial correction and slow things down

gr_anim2 <- gr_all2 + transition_time(minute) +
  labs(title = "Minute: {frame_time}")

animate(gr_anim2, fps=2, nframes=60)